Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
نویسندگان
چکیده
Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
منابع مشابه
Post-Triassic normal faulting and extensional structures in Central Alborz, Northern Iran
This paper presents structural evidence of extensional activity in Central Alborz during Mesozoic. The structural evidence of homogenous early stage stretching such as layer-parallel to oblique boudinage of Permian and Triassic rocks in various portions of the study area accompanied by extensional-fibrous fractures were followed with more advanced extensional features. These extensional structu...
متن کاملDevelopment of system decision support tools for behavioral trends monitoring of machinery maintenance in a competitive environment
The article is centred on software system development for manufacturing company that produces polyethylene bags using mostly conventional machines in a competitive world where each business enterprise desires to stand tall. This is meant to assist in gaining market shares, taking maintenance and production decisions by the dynamism and flexibilities embedded in the package as customers’ demand ...
متن کاملGeneration Scheduling in Large-Scale Power Systems with Wind Farms Using MICA
The growth in demand for electric power and the rapid increase in fuel costs, in whole of theworld need to discover new energy resources for electricity production. Among of the nonconventionalresources, wind and solar energy, is known as the most promising deviceselectricity production in the future. In this thesis, we study follows to long-term generationscheduling of power systems in the pre...
متن کاملCommunication-efficient Outlier Detection for Scale-out Systems
Modern scale-out services are built on top of large datacenters composed of thousands of individual machines. These must be continuously monitored because unexpected failures can overload fail-over mechanism and cause large-scale outages. Such monitoring can be accomplished by periodically measuring hundreds of performance metrics and looking for outliers, often caused by misconfigurations, har...
متن کاملFAIL-FCI: Versatile fault injection
One of the topics of paramount importance in the development of Grid middleware is the impact of faults, since their probability of occurrence in a Grid infrastructure and in large-scale distributed systems is actually very high. In this paper, we explore the versatility of a new tool for fault injection in distributed applications: FAIL-FCI. In particular, we show that not only are we able to ...
متن کامل